Methods for identifying biological samples

ABSTRACT

The present invention provides methods for marking nucleic acid samples with detectable markers by adding different combinations of marker molecules to each sample. Each sample may be marked with a different combination of two or more marker molecules each carrying a different tag nucleic acid sequences. The tag nucleic acid sequences may be random sequences that are not naturally occurring in the nucleic acid sample and do not cross hybridize to sequences naturally occurring in the nucleic acid sample. Methods of detecting the combination of tag sequences present in a sample, in parallel with methods of genetic analysis of the sample are disclosed. Kits containing marker molecules suitable for generating barcoded samples by mixing different combinations of marker molecules into each sample are also disclosed.

RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application No.60/610,668 filed Sep. 17, 2004, the entire disclosure of which isincorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to the field of nucleic acid analysis andfor methods for marking samples with an internal detectable markingsystem. The marking system comprises combinations of two or more markingsequences, allowing a small number of marking sequences to be used togenerate a large number of unique combinations.

REFERENCE TO SEQUENCE LISTING

The Sequence Listing submitted on compact disk is hereby incorporated byreference. The file on the disk is named 3697.1seqlist.txt, the file is41 KB and the date of creation of the compact discs is Sep. 19, 2005.The machine format for the discs is IBM-PC and the operating systemcompatibility is MS-WINDOWS 2000.

BACKGROUND OF THE INVENTION

Methods of genetic analysis of biological samples typically involveanalysis of a liquid aliquot of the sample. The aliquot for analysis istypically transferred from the original source to a new container forsubsequent analysis. The new container is typically associated with theoriginal sample and the original source, for example, by a labelingsystem, whereby the containers are labeled. The movement of an aliquotof a sample to a new container presents an opportunity for the aliquotto be wrongly associated, for example, if the new container is notlabeled correctly. Aliquoting and subsequent manipulation steps are alsoopportunities where contamination may be introduced to the sample, forexample, mixing of material from two different biological sources.

SUMMARY OF THE INVENTION

Methods for marking biological samples with a unique combination of tagsequences that can be detected by hybridization are disclosed. Ahybridization pattern that is characteristic of the combination of tagsequences in the sample can be used as a barcode to identify the sample.Barcodes are generated by combining marking molecules, for example,marker plasmids or marker adaptors (sometimes called barcode plasmidsand barcode adaptors), in combinations of two or more so that sampleswithin a group of samples are uniquely marked. The sample is marked witha known combination of tag sequences that comprises at least twodifferent tag sequences. The tag sequences may be carried on plasmids sothat fragments containing the tag sequence may be generated byrestriction digestion of the sample containing the plasmid. The fragmentcontaining the tag sequence may be ligated to adaptors comprisingpriming sites and amplified, for example, by PCR.

In preferred embodiments the tag sequences that form the barcode areamplified in the sample in parallel with the amplification of the markedsample. For example, if the sample is being genotyped the barcode tagsequences are amplified by the same method that the genomic fragmentsfor genotyping are analyzed. In preferred embodiments this is by theWGSA method of fragmentation with a restriction enzyme, adaptor ligationand amplification of fragments of a selected, limited size range. If thesample is being analyzed for gene expression analysis the barcode tagsequences may be part of a polyadenylated transcript suitable foramplification using a T7 oligo dT primer or as an un-polyadenylatedtranscript, suitable for reverse transcription using random primers withor without a T7 promoter primer.

Kits comprising marker molecules, including barcode plasmids and barcodeadaptors are disclosed. The marker molecules may be arranged in a formatthat facilitates barcoding, for example a multiwell plate.

The methods are particularly useful for detection of contamination ofone sample by another sample, mis-identification of samples andcontamination of one sample with amplicons of another sample.

BRIEF DESCRIPTION OF THE FIGURE

FIG. 1 shows a schematic of one embodiment. A plasmid containing abarcode sequence is fragmented with a selected restriction enzyme. Thebarcode sequence is contained within a fragment that is between 250 and2000 base pairs. Adaptors are ligated to the fragments and the fragmentcontaining the barcode is efficiently amplified.

FIG. 2 shows an example of a possible arrangement of the barcode probesets on an array. A probe set that is complementary to each barcode ispresent at two different locations on the array and the barcode probesets are positioned throughout the array. Visual inspection of thehybridization pattern can distinguish between different barcodecombinations.

FIG. 3 shows a schematic of a method of simultaneously detecting afragment of interest (103) and a tag sequence in a marker molecule(105).

FIG. 4 is a schematic of fragments that are expected when barcodesequences are added as barcode fragments that can be ligated toadaptors.

FIG. 5 shows a schematic of the 100K barcode plasmids. FIG. 5A shows thepFC48 vector and 5B shows the 100K clones.

FIG. 6 shows a schematic of the 500K barcode plasmids. FIG. 6A shows thepFC51 vector and 6B shows the 500K clones.

DETAILED DESCRIPTION OF THE INVENTION

a) General

The present invention has many preferred embodiments and relies on manypatents, applications and other references for details known to those ofthe art. Therefore, when a patent, application, or other reference iscited or repeated below, it should be understood that it is incorporatedby reference in its entirety for all purposes as well as for theproposition that is recited.

As used in this application, the singular form “a,” “an,” and “the”include plural references unless the context clearly dictates otherwise.The term “an agent”, for example, includes a plurality of agents,including mixtures thereof.

An individual is not limited to a human being but may also be otherorganisms including but not limited to mammals, plants, bacteria, orcells derived from any of the above.

Throughout this disclosure, various aspects of this invention can bepresented in a range format. It should be understood that thedescription in range format is merely for convenience and brevity andshould not be construed as an inflexible limitation on the scope of theinvention. Accordingly, the description of a range should be consideredto have specifically disclosed all the possible subranges as well asindividual numerical values within that range. For example, descriptionof a range such as from 1 to 6 should be considered to have specificallydisclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numberswithin that range, for example, 1, 2, 3, 4, 5, and 6. This appliesregardless of the breadth of the range.

The practice of the present invention may employ, unless otherwiseindicated, conventional techniques and descriptions of organicchemistry, polymer technology, molecular biology (including recombinanttechniques), cell biology, biochemistry, and immunology, which arewithin the skill of the art. Such conventional techniques includepolymer array synthesis, hybridization, ligation, and detection ofhybridization using a label. Specific illustrations of suitabletechniques can be had by reference to the example herein below. However,other equivalent conventional procedures can, of course, also be used.Such conventional techniques and descriptions can be found in standardlaboratory manuals such as Genome Analysis: A Laboratory Manual Series(Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A LaboratoryManual, PCR Primer: A Laboratory Manual, and Molecular Cloning: ALaboratory Manual (all from Cold Spring Harbor Laboratory Press),Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, New York, Gait,“Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press,London, Nelson and Cox (2000), Lehninger, Principles of Biochemistry3^(rd) Ed., W.H. Freeman Pub., New York, NY and Berg et al. (2002)Biochemistry, 5^(th) Ed., W.H. Freeman Pub., New York, N.Y., all ofwhich are herein incorporated in their entirety by reference for allpurposes.

The present invention can employ solid substrates, including arrays insome preferred embodiments. Methods and techniques applicable to polymer(including protein) array synthesis have been described in U.S. Ser. No.09/536,841, WO 00/58516, U.S. Pat. Nos. 5,143,854, 5,242,974, 5,252,743,5,324,633, 5,384,261, 5,405,783, 5,424,186, 5,451,683, 5,482,867,5,491,074, 5,527,681, 5,550,215, 5,571,639, 5,578,832, 5,593,839,5,599,695, 5,624,711, 5,631,734, 5,795,716, 5,831,070, 5,837,832,5,856,101, 5,858,659, 5,936,324, 5,968,740, 5,974,164, 5,981,185,5,981,956, 6,025,601, 6,033,860, 6,040,193, 6,090,555, 6,136,269,6,269,846 and 6,428,752, in PCT Applications Nos. PCT/US99/00730(International Publication No. WO 99/36760) and PCT/US01/04285(International Publication No. WO 01/58593), which are all incorporatedherein by reference in their entirety for all purposes.

Patents that describe synthesis techniques in specific embodimentsinclude U.S. Pat. Nos. 5,412,087, 6,147,205, 6,262,216, 6,310,189,5,889,165, and 5,959,098. Nucleic acid arrays are described in many ofthe above patents, but the same techniques are applied to polypeptidearrays.

Nucleic acid arrays that are useful in the present invention includethose that are commercially available from Affymetrix (Santa Clara,Calif.) under the brand name GeneChip®. Example arrays are shown on thewebsite at affymetrix.com.

The present invention also contemplates many uses for polymers attachedto solid substrates. These uses include gene expression monitoring,profiling, library screening, genotyping and diagnostics. Geneexpression monitoring and profiling methods can be shown in U.S. Pat.Nos. 5,800,992, 6,013,449, 6,020,135, 6,033,860, 6,040,138, 6,177,248and 6,309,822. Genotyping and uses therefore are shown in U.S. Ser. Nos.10/442,021, 10/013,598 (U.S. Patent Application Publication20030036069), and U.S. Pat. Nos. 5,856,092, 6,300,063, 5,858,659,6,284,460, 6,361,947, 6,368,799 and 6,333,179. Other uses are embodiedin U.S. Pat. Nos. 5,871,928, 5,902,723, 6,045,996, 5,541,061, and6,197,506.

The present invention also contemplates sample preparation methods incertain preferred embodiments. Prior to or concurrent with genotyping,the genomic sample may be amplified by a variety of mechanisms, some ofwhich may employ PCR. See, for example, PCR Technology: Principles andApplications for DNA Amplification (Ed. H. A. Erlich, Freeman Press, NY,N.Y., 1992); PCR Protocols: A Guide to Methods and Applications (Eds.Innis, et al., Academic Press, San Diego, Calif., 1990); Mattila et al.,Nucleic Acids Res. 19, 4967 (1991); Eckert et al., PCR Methods andApplications 1, 17 (1991); PCR (Eds. McPherson et al., IRL Press,Oxford); and U.S. Pat. Nos. 4,683,202, 4,683,195, 4,800,159 4,965,188,and 5,333,675, and each of which is incorporated herein by reference intheir entireties for all purposes. The sample may be amplified on thearray. See, for example, U.S. Pat. No. 6,300,070 and U.S. Ser. No.09/513,300, which are incorporated herein by reference.

Other suitable amplification methods include the ligase chain reaction(LCR) (for example, Wu and Wallace, Genomics 4, 560 (1989), Landegren etal., Science 241, 1077 (1988) and Barringer et al. Gene 89:117 (1990)),transcription amplification (Kwoh et al., Proc. Natl. Acad. Sci. USA 86,1173 (1989) and WO88/10315), self-sustained sequence replication(Guatelli et al., Proc. Nat. Acad. Sci. USA, 87, 1874 (1990) andWO90/06995), selective amplification of target polynucleotide sequences(U.S. Pat. No. 6,410,276), consensus sequence primed polymerase chainreaction (CP-PCR) (U.S. Pat. No. 4,437,975), arbitrarily primedpolymerase chain reaction (AP-PCR) (U.S. Pat. Nos. 5, 413,909,5,861,245) and nucleic acid based sequence amplification (NASBA). (See,U.S. Pat. Nos. 5,409,818, 5,554,517, and 6,063,603, each of which isincorporated herein by reference). Other amplification methods that maybe used include: Qbeta Replicase, described in PCT Patent ApplicationNo. PCT/US87/00880, isothermal amplification methods such as SDA,described in Walker et al. 1992, Nucleic Acids Res. 20(7):1691-6, 1992,and rolling circle amplification, described in U.S. Pat. No. 5,648,245.Other amplification methods that may be used are described in, U.S. Pat.Nos. 5,242,794, 5,494,810, 4,988,617 and in U.S. Ser. No. 09/854,317,each of which is incorporated herein by reference. Other amplificationmethods that may be used are disclosed in U.S. Patent ApplicationPublication No. 20030143599.

Additional methods of sample preparation and techniques for reducing thecomplexity of a nucleic sample are described in Dong et al., GenomeResearch 11, 1418 (2001), in U.S. Pat. No. 6,361,947, 6,391,592 and U.S.Ser. Nos. 09/916,135, 09/920,491 (U.S. Patent Application Publication20030096235), U.S. Ser. No. 09/910,292 (U.S. Patent ApplicationPublication 20030082543), and U.S. Ser. No. 10/013,598.

Methods for conducting polynucleotide hybridization assays have beenwell developed in the art. Hybridization assay procedures and conditionswill vary depending on the application and are selected in accordancewith the general binding methods known including those referred to in:Maniatis et al. Molecular Cloning: A Laboratory Manual (2^(nd) Ed. ColdSpring Harbor, N.Y, 1989); Berger and Kimmel Methods in Enzymology, Vol.152, Guide to Molecular Cloning Techniques (Academic Press, Inc., SanDiego, Calif., 1987); Young and Davism, P.N.A.S, 80: 1194 (1983).Methods and apparatus for carrying out repeated and controlledhybridization reactions have been described in U.S. Pat. Nos. 5,871,928,5,874,219, 6,045,996 and 6,386,749, 6,391,623 each of which areincorporated herein by reference

The present invention also contemplates signal detection ofhybridization between ligands in certain preferred embodiments. See U.S.Pat. Nos. 5,143,854, 5,578,832; 5,631,734; 5,834,758; 5,936,324;5,981,956; 6,025,601; 6,141,096; 6,185,030; 6,201,639; 6,218,803; and6,225,625, in U.S. Ser. No. 10/389,194 and in PCT ApplicationPCT/US99/06097 (published as W099/47964), each of which also is herebyincorporated by reference in its entirety for all purposes.

Methods and apparatus for signal detection and processing of intensitydata are disclosed in, for example, U.S. Pat. Nos. 5,143,854, 5,547,839,5,578,832, 5,631,734, 5,800,992, 5,834,758; 5,856,092, 5,902,723,5,936,324, 5,981,956, 6,025,601, 6,090,555, 6,141,096, 6,185,030,6,201,639; 6,218,803; and 6,225,625, in U.S. Ser. Nos. 10/389,194,60/493,495 and in PCT Application PCT/US99/06097 (published asWO99/47964), each of which also is hereby incorporated by reference inits entirety for all purposes.

The practice of the present invention may also employ conventionalbiology methods, software and systems. Computer software products of theinvention typically include computer readable medium havingcomputer-executable instructions for performing the logic steps of themethod of the invention. Suitable computer readable medium includefloppy disk, CD-ROM/DVD/DVD-ROM, hard-disk drive, flash memory, ROM/RAM,magnetic tapes and etc. The computer executable instructions may bewritten in a suitable computer language or combination of severallanguages. Basic computational biology methods are described in, forexample Setubal and Meidanis et al., Introduction to ComputationalBiology Methods (PWS Publishing Company, Boston, 1997); Salzberg,Searles, Kasif, (Ed.), Computational Methods in Molecular Biology,(Elsevier, Amsterdam, 1998); Rashidi and Buehler, Bioinformatics Basics:Application in Biological Science and Medicine (CRC Press, London, 2000)and Ouelette and Bzevanis Bioinformatics: A Practical Guide for Analysisof Gene and Proteins (Wiley & Sons, Inc., 2^(nd) ed., 2001). See U.S.Pat. No. 6,420,108.

The present invention may also make use of various computer programproducts and software for a variety of purposes, such as probe design,management of data, analysis, and instrument operation. See, U.S. Pat.Nos. 5,593,839, 5,795,716, 5,733,729, 5,974,164, 6,066,454, 6,090,555,6,185,561, 6,188,783, 6,223,127, 6,229,911 and 6,308,170.

Additionally, the present invention may have preferred embodiments thatinclude methods for providing genetic information over networks such asthe Internet as shown in U.S. Ser. Nos. 10/197,621, 10/063,559 (U.S.Publication No. 20020183936), U.S. Ser. Nos. 10/065,856, 10/065,868,10/328,818, 10/328,872, 10/423,403, and 60/482,389.

b) Definitions

The term “array” as used herein refers to an intentionally createdcollection of molecules which can be prepared either synthetically orbiosynthetically. The molecules in the array can be identical ordifferent from each other. The array can assume a variety of formats,forexample, libraries of soluble molecules; libraries of compounds tetheredto resin beads, silica chips, or other solid supports.

The term “barcode” is used to refer to a unique combination of nucleicacid sequences. Each sample can be marked with different combinations oftag sequences that can be detected by hybridization to probescomplementary to a plurality of tag sequences This generates a uniquehybridization pattern depending on the tag sequences present in thesample. The pattern serves as a “barcode” in that it uniquely identifiesthe combination of tags in the sample, thus identifying the sample. In apreferred aspect the probes are part of an array of probes. In apreferred embodiment the barcode comprises a combination of two or moredifferent marker molecules. Each marker molecule includes at least oneunique tag sequence (barcode sequence) of at least 15 or at least 20bases. The tag sequence or sequences are preferably part of a largernucleic acid, for example a marker or barcode plasmid or within a markerfragment. The barcode may generated by mixing two or more markermolecules with a nucleic acid sample, for example, a genomic DNA samplefrom one or more individuals. The marker molecules can be addedindividually to the sample or they may be added in combinations of twoor more.

Each marker molecule preferably comprises a stretch of 15 to about 200bases, more preferably 20-60 bases, or 20-40 bases, of nucleic acid tagsequence. The tag sequence is selected to be sequence that does notnaturally occur in the nucleic acid sample. For example, often thesample is genomic DNA from a human and the tag sequences are selected bycomparing a candidate tag sequence, for example, a random 20 mer, to adatabase of known sequences to identify sequences that are notsignificantly homologous to a known human sequence. Preferably for a 20mer tag the sequence differs from the closest sequence in the genome byat least 2, 3 or 4 bases. Preferably tag sequences are selected so thatthey will not cross hybridize to known human sequences under selectedhybridization conditions. In one aspect the marker molecules comprise 1,2, 3, 4, 5 or more tag sequences selected from a set of tag sequences.The multiple tag sequences may form a continuous larger tag sequence,for example, 40 bases of tag sequence may be formed from two 20 mertags. Sets of tag sequences and methods of selecting tag sequences aredisclosed in U.S. Pat. No. 6,458,530 and U.S. patent application Ser.No. 09/827,383. In one aspect tag sequences in a set may be all the samelength, have melting temperatures that are within the same temperaturerange, plus or minus, 2 to 5° C., and do not cross hybridize to othertags in the set, to the complement of other tags in the set or tosequences in the genome of a selected organism or organisms.

The term “barcode plasmid” refers to a construct that includes a plasmidwith at least one tag sequence insert. In preferred embodiments a tagsequence of about 15-200 bases, more preferably 20-40 bases, or 20-60bases, is cloned into one or more restriction sites of a plasmid. Thetag sequence is preferably cloned into the plasmid so that it can bereleased by digestion with a single enzyme selected from a set ofenzymes to generate a restriction fragment of between 200 and 1,000bases that contains the tag sequence.

The term “barcode adaptor” refers to a nucleic acid fragment thatcomprises one or more tag sequences. Barcode adaptors are shown in FIG.3. A barcode adaptor may be two synthetic oligonucleotides hybridizedtogether to form an adaptor. The barcode adaptor preferably has at leastone single stranded overhang, or “sticky end” to facilitate ligation.The barcode adaptor may also comprise other sequences, such as primingsites, recognition sites for restriction enzymes or a promoter sites foran RNA polymerase. One or more barcode plasmids or barcode adaptors maybe added to the nucleic acid sample to mark the sample with a barcode.The barcode may be a combination of one or more barcode plasmids and oneor more barcode adaptors.

The term “biomonomer” as used herein refers to a single unit ofbiopolymer, which can be linked with the same or other biomonomers toform a biopolymer (for example, a single amino acid or nucleotide withtwo linking groups one or both of which may have removable protectinggroups) or a single unit which is not part of a biopolymer. Thus, forexample, a nucleotide is a biomonomer within an oligonucleotidebiopolymer, and an amino acid is a biomonomer within a protein orpeptide biopolymer; avidin, biotin, antibodies, antibody fragments,etc., for example, are also biomonomers.

The term “biopolymer” or sometimes refer by “biological polymer” as usedherein is intended to mean repeating units of biological or chemicalmoieties. Representative biopolymers include, but are not limited to,nucleic acids, oligonucleotides, amino acids, proteins, peptides,hormones, oligosaccharides, lipids, glycolipids, lipopolysaccharides,phospholipids, synthetic analogues of the foregoing, including, but notlimited to, inverted nucleotides, peptide nucleic acids, Meta-DNA, andcombinations of the above.

The term “biopolymer synthesis” as used herein is intended to encompassthe synthetic production, both organic and inorganic, of a biopolymer.Related to a bioploymer is a “biomonomer”.

The term “combinatorial synthesis strategy” as used herein refers to acombinatorial synthesis strategy is an ordered strategy for parallelsynthesis of diverse polymer sequences by sequential addition ofreagents which may be represented by a reactant matrix and a switchmatrix, the product of which is a product matrix. A reactant matrix is aI column by m row matrix of the building blocks to be added. The switchmatrix is all or a subset of the binary numbers, preferably ordered,between 1 and m arranged in columns. A “binary strategy” is one in whichat least two successive steps illuminate a portion, often half, of aregion of interest on the substrate. In a binary synthesis strategy, allpossible compounds which can be formed from an ordered set of reactantsare formed. In most preferred embodiments, binary synthesis refers to asynthesis strategy which also factors a previous addition step. Forexample, a strategy in which a switch matrix for a masking strategyhalves regions that were previously illuminated, illuminating about halfof the previously illuminated region and protecting the remaining half(while also protecting about half of previously protected regions andilluminating about half of previously protected regions). It will berecognized that binary rounds may be interspersed with non-binary roundsand that only a portion of a substrate may be subjected to a binaryscheme. A combinatorial “masking” strategy is a synthesis which useslight or other spatially selective deprotecting or activating agents toremove protecting groups from materials for addition of other materialssuch as amino acids.

The term “complementary” as used herein refers to the hybridization orbase pairing between nucleotides or nucleic acids, such as, forinstance, between the two strands of a double stranded DNA molecule orbetween an oligonucleotide primer and a primer binding site on a singlestranded nucleic acid to be sequenced or amplified. Complementarynucleotides are, generally, A and T (or A and U), or C and G. Two singlestranded RNA or DNA molecules are said to be complementary when thenucleotides of one strand, optimally aligned and compared and withappropriate nucleotide insertions or deletions, pair with at least about80% of the nucleotides of the other strand, usually at least about 90%to 95%, and more preferably from about 98 to 100%. Alternatively,complementarity exists when an RNA or DNA strand will hybridize underselective hybridization conditions to its complement. Typically,selective hybridization will occur when there is at least about 65%complementary over a stretch of at least 14 to 25 nucleotides,preferably at least about 75%, more preferably at least about 90%complementary. See, M. Kanehisa Nucleic Acids Res. 12:203 (1984),incorporated herein by reference.

The term “effective amount” as used herein refers to an amountsufficient to induce a desired result.

The term “genome” as used herein is all the genetic material in thechromosomes of an organism. DNA derived from the genetic material in thechromosomes of a particular organism is genomic DNA. A genomic libraryis a collection of clones made from a set of randomly generatedoverlapping DNA fragments representing the entire genome of an organism.

The term “genotype” as used herein refers to the genetic information anindividual carries at one or more positions in the genome. A genotypemay refer to the information present at a single polymorphism, forexample, a single SNP. For example, if a SNP is biallelic and can beeither an A or a C then if an individual is homozygous for A at thatposition the genotype of the SNP is homozygous A or AA. Genotype mayalso refer to the information present at a plurality of polymorphicpositions.

The term “hybridization” as used herein refers to the process in whichtwo single-stranded polynucleotides bind non-covalently to form a stabledouble-stranded polynucleotide; triple-stranded hybridization is alsotheoretically possible. The resulting (usually) double-strandedpolynucleotide is a “hybrid.” The proportion of the population ofpolynucleotides that forms stable hybrids is referred to herein as the“degree of hybridization.” Hybridizations are usually performed understringent conditions, for example, at a salt concentration of no morethan about 1 M and a temperature of at least 25° C. For example,conditions of 5×SSPE (750 mM NaCl, 50 mM NaPhosphate, 5 mM EDTA, pH 7.4)and a temperature of 25-30° C. are suitable for allele-specific probehybridizations or conditions of 100 mM MES, 1 M [Na⁺], 20 mM EDTA, 0.01%Tween-20 and a temperature of 30-50° C., preferably at about 45-50° C.Hybridizations may be performed in the presence of agents such asherring sperm DNA at about 0.1 mg/ml, acetylated BSA at about 0.5 mg/ml.As other factors may affect the stringency of hybridization, includingbase composition and length of the complementary strands, presence oforganic solvents and extent of base mismatching, the combination ofparameters is more important than the absolute measure of any one alone.Hybridization conditions suitable for microarrays are described in theGene Expression Technical Manual, 2004 and the GeneChip Mapping AssayManual, 2004.

The term “hybridization probes” as used herein are oligonucleotidescapable of binding in a base-specific manner to a complementary strandof nucleic acid. Such probes include peptide nucleic acids, as describedin Nielsen et al., Science 254, 1497-1500 (1991), LNAs, as described inKoshkin et al. Tetrahedron 54:3607-3630, 1998, and U.S. Pat. No.6,268,490 and other nucleic acid analogs and nucleic acid mimetics.

The term “hybridizing specifically to” as used herein refers to thebinding, duplexing, or hybridizing of a molecule only to a particularnucleotide sequence or sequences under stringent conditions when thatsequence is present in a complex mixture (for example, total cellular)DNA or RNA.

The term “initiation biomonomer” or “initiator biomonomer” as usedherein is meant to indicate the first biomonomer which is covalentlyattached via reactive nucleophiles to the surface of the polymer, or thefirst biomonomer which is attached to a linker or spacer arm attached tothe polymer, the linker or spacer arm being attached to the polymer viareactive nucleophiles.

The term “isolated nucleic acid” as used herein mean an object speciesinvention that is the predominant species present (i.e., on a molarbasis it is more abundant than any other individual species in thecomposition). Preferably, an isolated nucleic acid comprises at leastabout 50, 80 or 90% (on a molar basis) of all macromolecular speciespresent. Most preferably, the object species is purified to essentialhomogeneity (contaminant species cannot be detected in the compositionby conventional detection methods).

The term “mixed population” or sometimes refer by “complex population”as used herein refers to any sample containing both desired andundesired nucleic acids. As a non-limiting example, a complex populationof nucleic acids may be total genomic DNA, total genomic RNA or acombination thereof. Moreover, a complex population of nucleic acids mayhave been enriched for a given population but include other undesirablepopulations. For example, a complex population of nucleic acids may be asample which has been enriched for desired messenger RNA (mRNA)sequences but still includes some undesired ribosomal RNA sequences(rRNA).

The term “monomer” as used herein refers to any member of the set ofmolecules that can be joined together to form an oligomer or polymer.The set of monomers useful in the present invention includes, but is notrestricted to, for the example of (poly)peptide synthesis, the set ofL-amino acids, D-amino acids, or synthetic amino acids. As used herein,“monomer” refers to any member of a basis set for synthesis of anoligomer. For example, dimers of L-amino acids form a basis set of 400“monomers” for synthesis of polypeptides. Different basis sets ofmonomers may be used at successive steps in the. synthesis of a polymer.The term “monomer” also refers to a chemical subunit that can becombined with a different chemical subunit to form a compound largerthan either subunit alone.

The term “nucleic acid library” or sometimes refer by “array” as usedherein refers to an intentionally created collection of nucleic acidswhich can be prepared either synthetically or biosynthetically andscreened for biological activity in a variety of different formats (forexample, libraries of soluble molecules; and libraries of oligostethered to resin beads, silica chips, or other solid supports).Additionally, the term “array” is meant to include those libraries ofnucleic acids which can be prepared by spotting nucleic acids ofessentially any length (for example, from 1 to about 1000 nucleotidemonomers in length) onto a substrate. The term “nucleic acid” as usedherein refers to a polymeric form of nucleotides of any length, eitherribonucleotides, deoxyribonucleotides or peptide nucleic acids (PNAs),that comprise purine and pyrimidine bases, or other natural, chemicallyor biochemically modified, non-natural, or derivatized nucleotide bases.The backbone of the polynucleotide can comprise sugars and phosphategroups, as may typically be found in RNA or DNA, or modified orsubstituted sugar or phosphate groups. A polynucleotide may comprisemodified nucleotides, such as methylated nucleotides and nucleotideanalogs. The sequence of nucleotides may be interrupted bynon-nucleotide components. Thus the terms nucleoside, nucleotide,deoxynucleoside and deoxynucleotide generally include analogs such asthose described herein. These analogs are those molecules having somestructural features in common with a naturally occurring nucleoside ornucleotide such that when incorporated into a nucleic acid oroligonucleoside sequence, they allow hybridization with a naturallyoccurring nucleic acid sequence in solution. Typically, these analogsare derived from naturally occurring nucleosides and nucleotides byreplacing and/or modifying the base, the ribose or the phosphodiestermoiety. The changes can be tailor made to stabilize or destabilizehybrid formation or enhance the specificity of hybridization with acomplementary nucleic acid sequence as desired.

The term “nucleic acids” as used herein may include any polymer oroligomer of pyrimidine and purine bases, preferably cytosine, thymine,and uracil, and adenine and guanine, respectively. See Albert L.Lehninger, PRINCIPLES OF BIOCHEMISTRY, at 793-800 (Worth Pub. 1982).Indeed, the present invention contemplates any deoxyribonucleotide,ribonucleotide or peptide nucleic acid component, and any chemicalvariants thereof, such as methylated, hydroxymethylated or glucosylatedforms of these bases, and the like. The polymers or oligomers may beheterogeneous or homogeneous in composition, and may be isolated fromnaturally-occurring sources or may be artificially or syntheticallyproduced. In addition, the nucleic acids may be DNA or RNA, or a mixturethereof, and may exist permanently or transitionally in single-strandedor double-stranded form, including homoduplex, heteroduplex, and hybridstates.

The term “oligonucleotide” or sometimes refer by “polynucleotide” asused herein refers to a nucleic acid ranging from at least 2, preferableat least 8, and more preferably at least 20 nucleotides in length or acompound that specifically hybridizes to a polynucleotide.Polynucleotides of the present invention include sequences ofdeoxyribonucleic acid (DNA) or ribonucleic acid (RNA) which may beisolated from natural sources, recombinantly produced or artificiallysynthesized and mimetics thereof. A further example of a polynucleotideof the present invention may be peptide nucleic acid (PNA). Theinvention also encompasses situations in which there is a nontraditionalbase pairing such as Hoogsteen base pairing which has been identified incertain tRNA molecules and postulated to exist in a triple helix.“Polynucleotide” and “oligonucleotide” are used interchangeably in thisapplication.

The term “primer” as used herein refers to a single-strandedoligonucleotide capable of acting as a point of initiation fortemplate-directed DNA synthesis under suitable conditions for example,buffer and temperature, in the presence of four different nucleosidetriphosphates and an agent for polymerization, such as, for example, DNAor RNA polymerase or reverse transcriptase. The length of the primer, inany given case, depends on, for example, the intended use of the primer,and generally ranges from 15 to 30 nucleotides. Short primer moleculesgenerally require cooler temperatures to form sufficiently stable hybridcomplexes with the template. A primer need not reflect the exactsequence of the template but must be sufficiently complementary tohybridize with such template. The primer site is the area of thetemplate to which a primer hybridizes. The primer pair is a set ofprimers including a 5′ upstream primer that hybridizes with the 5′ endof the sequence to be amplified and a 3′ downstream primer thathybridizes with the complement of the 3′ end of the sequence to beamplified.

The term “probe” as used herein refers to a surface-immobilized moleculethat can be recognized by a particular target. See U.S. Pat. No.6,582,908 for an example of arrays having all possible combinations ofprobes with 10, 12, and more bases. Examples of probes that can beinvestigated by this invention include, but are not restricted to,agonists and antagonists for cell membrane receptors, toxins and venoms,viral epitopes, hormones (for example, opioid peptides, steroids, etc.),hormone receptors, peptides, enzymes, enzyme substrates cofactors,drugs, lectins, sugars, oligonucleotides, nucleic acids,oligosaccharides, proteins, and monoclonal antibodies.

The term “solid support”, “support”, and “substrate” as used herein areused interchangeably and refer to a material or group of materialshaving a rigid or semi-rigid surface or surfaces. In many embodiments,at least one surface of the solid support will be substantially flat,although in some embodiments it may be desirable to physically separatesynthesis regions for different compounds with, for example, wells,raised regions, pins, etched trenches, or the like. According to otherembodiments, the solid support(s) will take the form of beads, resins,gels, microspheres, or other geometric configurations. See U.S. Pat. No.5,744,305 for exemplary substrates.

The term “target” as used herein refers to a molecule that has anaffinity for a given probe. Targets may be naturally-occurring orman-made molecules. Also, they can be employed in their unaltered stateor as aggregates with other species. Targets may be attached, covalentlyor noncovalently, to a binding member, either directly or via a specificbinding substance. Examples of targets which can be employed by thisinvention include, but are not restricted to, antibodies, cell membranereceptors, monoclonal antibodies and antisera reactive with specificantigenic determinants (such as on viruses, cells or other materials),drugs, oligonucleotides, nucleic acids, peptides, cofactors, lectins,sugars, polysaccharides, cells, cellular membranes, and organelles.Targets are sometimes referred to in the art as anti-probes. As the termtargets is used herein, no difference in meaning is intended. A “ProbeTarget Pair” is formed when two macromolecules have combined throughmolecular recognition to form a complex.

Molecular Barcodes for Internally Marking and Tracking Samples

Many methods of genetic analysis require analysis of large numbers ofdifferent samples. Each sample may be derived from a differentindividual or a different source and keeping track of sample identity isfrequently essential to analysis of results. Samples may becomecontaminated by other samples, the identity of a sample may be lost or asample may become misidentified. Methods of integrally marking a samplein a detectable manner are disclosed. In a preferred embodiment thesample is marked by the addition of at least one amplifiable tagsequence and detection of the sequence takes place in parallel with theanalysis of the biological samples. Plasmids carrying tag sequences andexamples of tag sequences that may be used in the presently disclosedmethods have been disclosed in U.S. Patent Publication No. 20040175719,U.S. patent application Ser. No. 09/827,383 and U.S. Pat. No. 6,458,530.Methods of marking samples have been disclosed in, for example, U.S.Patent Pub. No. 20040166520 and U.S. Pat. Nos. 6,544,739, 5,643,728 and5,451,505.

Technological advances in recent years have provided researchers andclinicians with the ability to process large numbers of biologicalsamples for analysis of genotype. These methods may be used, forexample, to diagnose disease or risk of disease, or to identifyassociations between a phenotype and a genetic region. Genotypingmethods may also be used in forensic purposes and for paternityanalysis. For most of these genotyping applications it is important thatthe sample be properly identified as to source of origin and that duringsubsequent manipulation steps the sample can be correctly identified.Marking the container in which a sample is placed has been one method ofmarking but sample mix ups, such as errors in labeling andcross-contamination between samples do occur and can be difficult todetect without a marking method that is integrated within the sampleitself. The presently disclosed methods provide mechanisms to mark asample so that sample identities can be checked to preventmisidentification of a sample and to detect cross contamination betweensamples.

The disclosed methods may be used, for example, for marking and trackingbiological samples with known combinations of marker molecules carryingtag sequences. The methods may be particularly useful for trackingsamples in high throughput assays, for example, when samples are treatedor stored in multiwell plates. In one embodiment different amounts of aplurality of control nucleic acids are added to each of a plurality ofgenomic samples. The spike in controls may be amplified along with thesample and analyzed in parallel with analysis of the sample, forexample, in a gene expression or genotyping analysis method. In apreferred embodiment the analysis includes a step of hybridization to anarray of probes. Samples can be analyzed to detect the marker moleculesby other methods such as PCR. The control nucleic acids in a preferredembodiment are combinatorial sets of plasmids where each plasmidcontains a different tag sequence, for example a 40 base pair sequence.In a preferred embodiment the tag sequence may be a tag sequence oflength at least 20 bases.

In FIG. 1 a barcode plasmid marker molecule is shown. The tag sequenceis a tag sequence that is flanked by Xba I sites. When the plasmid or asample containing the plasmid is digested with Xba I, two fragmentsresult. The larger fragment shown is about 4 Kb and the smaller, whichcontains the tag sequence that may be used as a component of a barcode,is about 500 base pairs. After adaptor ligation the fragment containingthe tag sequence can be efficiently amplified by PCR using a primer tothe adaptor sequence. The larger fragment is inefficiently amplified.FIG. 2 shows a schematic of hybridization patterns resulting from thepresence of different barcodes in different samples. In the top paneltag sequences A, B and D are present and hybridization is detected atthe A, B, and D probe sets but not at the C probe set. In the lowerpanel the hybridization pattern shows that tags B, C and D are presentbut not A. The probe sets for the tag sequences are present at differentlocations of the array and are present in duplicate on the array.

In one aspect (FIG. 3) the barcode is identified by the same processthat is used to analyze sequences of interest in the sample. The samplecontaining genomic DNA (101) with sequence of interest (103) has beenmarked with a marker plasmid (105) carrying a tag sequence. Restrictionsites (indicated by arrows) for a selected enzyme, for example, Sty I,flank the tag sequence and the sequence of interest. Upon digestion withthe restriction enzyme a variety of restriction fragments (115),including a fragment that includes the tag sequence (107) and a fragmentthat includes the sequence of interest (109), are generated. An adaptoris ligated to the restriction fragments to generate adaptor-ligatedfragments (117). The adaptor ligated fragments are amplified using aprimer complementary to the adaptor. This reduces the complexity of thesample because only fragments that are in a selected size range (about200 to 2000 base pairs) are efficiently amplified. The amplifiedfragments are subjected to an additional fragmentation step followed bylabeling. The labeled fragments are then hybridized to an array. Thearray has probes to analyze the sequence of interest and to detect thetag sequence. In preferred aspects the sequence of interest includes aSNP and the array includes probes to determine the allele or allelespresent at that SNP. In preferred aspects the array is a genotypingarray such as the GENECHIP® Mapping 100K set (Part Numbers 900517 and900523) or GENECHIP® Mapping 500K set available from Affymetrix, Inc.,Santa Clara.

In some embodiments the barcode method may be used to mark biologicalsamples prior to gene expression analysis. The marker molecules mayinclude a polyA sequence 3′ of the tag sequence and a promoter sequencefor a phage polymerase, such as T3, T7 and SP6 RNA polymerase, 5′ of thetag sequence. The sample may be reverse transcribed in parallel with thesample using an oligo dT primer or a random primer to make first strandcDNA. The first strand cDNA can be converted to double stranded cDNA,including a promoter for RNA polymerase, using standard methods. ManyRNA copies of the tag sequence may be transcribed using RNA polymerase.

In a preferred aspect the method involves adding a unique combination ofat least two different DNA tag molecules of known sequence to each of aplurality of genomic DNA samples so that a different combination of DNAtag molecules is added to each of the samples. The samples are subjectedto an amplification and complexity reduction step includingfragmentation with a restriction enzyme, adaptor ligation andamplification using a primer complementary to the adaptor. Thecomplexity is reduced because only restriction fragments that fallwithin a limited size range are efficiently amplified. For example,fragments that are about 200 to 2000 base pairs are efficientlyamplified and smaller or larger fragments are not amplified or arepoorly amplified. The DNA tag molecules are designed so that the tagsare on fragments that will be efficiently amplified during thecomplexity reduction and amplification step. The reduced complexitysample, including the tags, may then be analyzed by hybridization to anarray of probes. The array preferably includes probes to detect each ofthe different tags. Different combinations of tags should result in adifferent hybridization pattern that is characteristic of thecombination of tags included in the sample.

The methods may be used for tracking samples. Each sample is mixed witha different spiked in set of molecules containing tag sequences so thateach sample gets a barcode, which is a combination of tag sequences,that varies from the barcode in every other sample in the sample set, byat least one tag sequence.

In a preferred embodiment samples are marked by the addition of anucleic acid barcode to the sample. The barcode may be added to thesample during isolation of the sample from the biological source.Alternatively the barcode can be added after isolation of the desiredsample from the source, for example, the barcode may be added to a celllysate, an isolated nucleic acid sample, an isolated genomic DNA sample,or an isolated RNA sample. In a preferred embodiment the sample is abiological sample in solution and a solution of each barcode marker isadded so that the barcode marker is mixed into the sample. In preferredaspects the barcode marker includes two or more independent markermolecules, for example, two or more different sequence plasmids or twoor more different sequence fragments. When an aliquot of the biologicalsample is removed it preferably contains at least one copy of eachmarker molecule and preferably contains a plurality of each markermolecule.

In a preferred embodiment the barcode is amplified when the nucleic acidsample is amplified. The barcode is designed so that the conditions usedfor amplification of the sample will result in amplification of thebarcode. For example, if the sample will be amplified by reversetranscription with an oligo dT primer, the barcode may include a polyAsequence.

In a preferred embodiment the barcodes are detected by hybridization tooligonucleotide probes that are complementary to sequences within thebarcode. The probes may be attached to a solid support, for example, achip, a membrane or a bead.

Marker molecules are preferably used in combinations of 2 or more ineach sample. A small number of different tag sequences may be used togenerate a large number of different barcodes because the plasmids maybe combined in many independent combinations. The barcodes may bedetected using probes to the limited number of tag sequences. Thebarcode in each sample will hybridize to a different combination of thetag probes. In this way a limited number of detection probes may beused. For example, 6 different 2 letter combinations can be made fromthe letters A, B, C and D (AB, BC, CD, AC, AD, and BD) so 6 samples canbe uniquely marked with different combinations of 4 tags. In general theformula for the number of different permutations of K objects from a setof N objects is: N!/[K!(N−K)!]. For example, if there are 20 differentmarker molecules there are 190 different possible combinations of 2marker molecules, so 190 different barcodes[20!/2!(20−2)!=(20×19)/2=190]. For example, 20 tag sequences can be usedto uniquely mark each of 190 different sequences with a unique barcode.Each of the 190 barcodes can be uniquely detected using the same 20 tagsequence detection probes or probe sets. Similarly, a set of 10 markermolecules can be used to make 45 different combinations of 2 markermolecules, 120 different combinations of 3, and 210 differentcombinations of 4. A set of 20 different marker molecules can be used tomake 1140 combinations of 3 or 4845 combinations of 4. A set of 30different marker molecules can be used to make 435 combinations of 2,4060 combinations of 3 and 27,405 combinations of 4. In each of thesesets all possible combinations can be detected using the same limitedset of probes complementary to the tag or barcode sequence in thedifferent marker molecules. For example, the 27,405 differentcombinations of marker molecules that are possible when a set of 30different marker molecules are used in combinations of 4, can bedetected using 30 probe sets, where each probe set detects the tag orbarcode sequence in one of the marker molecules. The probe set comprisesone or more probes that are perfectly complementary to at least 20contiguous bases of tag sequence present in the marker molecule.

The methods are particularly well suited to standard multiwell plateformats. For example 20 barcode plasmids may be used to uniquely barcodeeach well of a 96 well plate, having 12 columns and 8 rows of wells. Thesame marker sequence is added to each well of each row for each of the 8rows, 1 plasmid in each row and one plasmid is used in each of the 12columns. Each well will have a different combination of two of the 20plasmids resulting in a different barcode combination for each sample.

In a preferred embodiment the sample is prepared for hybridization andanalysis using the whole genome sample assay (WGSA) as described inMatsusaki et al., Nature Methods 1: 109-111 (2004). The WGSA methodamplifies a reproducible subset of fragments from genomic DNA samples.The sample is fragmented with a restriction enzyme, adaptors are ligatedto the fragments and the fragments are amplified by PCR using a primerthat is complementary to the adaptor. Only fragments that are within alimited size range, about 200 to 2500 base pairs, are efficientlyamplified. The fragments that will be amplified can be predicted bydoing an in silico digestion of the genome. The barcode plasmids arepreferably constructed so that when the genomic sample containing thebarcode plasmid or plasmids is fragmented the barcode will be present ona fragment that will be efficiently amplified, for example, a fragmentthat is between 200 and 2500 base pairs, a and more preferably between300 and 1,000 base pairs.

In a preferred embodiment each of the barcode plasmids is constructed tocontain a region that includes 2 or more 20 base tag sequences. Theplasmid is designed so that the fragment containing the barcode regionis released as a fragment that will be efficiently amplified afterdigestion with a selected enzyme. In preferred embodiments the fragmentcontaining the barcode region is released after digestion with Xba I,Hind III, Nsp I or Sty I. Plasmids containing barcode regions may beprepared in large batches. Small amounts of each plasmid may be used ineach sample, for example, about 50 pg each marker plasmid per 250 nggenomic DNA. The presence of the barcode may also be detected by PCR.This may be used as a quality control mechanism. The probes that detectthe tag sequences preferably do not cross hybridize to genomic DNA, toother tags or tag probes or to other probes of the array. The arraypreferably has probes or probe sets that uniquely detect each markermolecule or tag sequence. In a preferred embodiment tag sequences areselected so that each hybridizes with similar intensity to the array.

In a preferred embodiment the barcodes are used to mark samples in agenotyping assay as described in U.S. patent application Ser. Nos.10/880,143 and 10/891,260 and U.S. Patent Pub. Nos. 20040067493 and20040146890, each of which are incorporated herein by reference in theirentireties. Briefly, a genomic sample is digested with a selectedrestriction enzyme, an adaptor comprising a universal priming site isligated to the ends of the fragments and the adaptor-ligated fragmentsare amplified by PCR using the universal priming site. Only fragmentsthat are less than about 2 kb and greater than about 200 base pairs areefficiently amplified, resulting in an enrichment of the regions of thegenome that are contained within fragments that are between 200 basepairs and 2,000 base pairs following digestion with the selected enzyme.The amplified fragments are then labeled, for example, with biotin andhybridized to an array comprising allele specific probes for SNPspresent within fragments that are 200 to 2000 base pairs. Each allele ofeach SNP may be interrogated by a plurality of probes. The barcodeconstruct has restriction sites arranged so that when the sample iscleaved with the selected restriction enzyme the barcode region will bewithin a fragment that is between 400 and 800 base pairs, the adaptorswill ligate to the ends of the barcode fragment and the fragment will beamplified during PCR with the universal primer. The barcode fragmentswill be labeled during the labeling reaction and the array comprisesprobes to detect the amplified barcode region.

The tags are artificial sequence selected to be absent from the genomebeing analyzed. The tags are selected so that they do not crosshybridize to the allele specific genotyping probes on the array. Methodsof selecting and using tag sequences have been disclosed, see forexample, U.S. Pat. No. 6,458,530 and U.S. patent application Ser. Nos.09/827,383 and 10/619,739 (U.S. Publication No. 20040175719). Inpreferred embodiments sequences are selected to be used an internalcontrols that can be spiked into a sample at known levels and detectedby hybridization. Preferably the sequences are not closely related tosequences in the genome of interest and more preferably the sequencesthat are to be used as controls are not similar to sequences in anygenome that may be analyzed. Such sequences which may be referred to as“alien” or “antigenomic” may be generated by a computer, for examplerandomly generated sequences, and checked against databases of availablesequences, for example, the GenBank database, to eliminate sequencesthat may cross hybridize to probes for known sequences. See alsoWO2004064482. In preferred embodiments the hybridization properties ofthe tag sequences and tag probes are selected to behave in a manner thatis similar to the naturally occurring sequences that will be analyzed.When designing probes for naturally occurring sequences the sequenceitself imposes constraints on the choice of probe and as a result on thebehavior of the probe. For example, in a genotyping assay using allelespecific probes for each allele of a selected SNP, the probes mustcorrespond to the region surrounding and including the SNP. These probesmay not have the optimal hybridization properties. Tag probes may beselected to have hybridization properties that are similar to the probesof the array that are directed to the genome of interest.

In one embodiment the probe sets for each barcode are distributed sothat they are present at different locations on the array. In oneembodiment the probe set for each barcode may be present in duplicate onthe array but in different locations, (FIG. 2).

In one embodiment the barcode sequences are spiked in as barcodeadaptors that may be ligated to genomic fragments (FIG. 4). Genomic DNAfragments (100) are mixed with a primer adaptor sequence (110) thatcontains a primer site and barcode adaptors (120 and 130) that eachcontain a different barcode sequence. The primer adaptor may be added atamounts that are significantly higher than the barcode adaptors, forexample, the primer adaptor may be added in amounts that are about 1,000times the amount of each of the barcode adaptors. The barcode adaptorswill only be ligated to a subset of the fragments. The barcode adaptorswill then be amplified along with the genomic fragment that they areligated to. The WGSA method fragments genomic DNA with a restrictionenzyme and then adaptor sequences are ligated to the ends of thefragments. Barcode adaptors may be added in during the ligation step.The barcode sequences will be ligated to the genomic fragments and thento the adaptor. The barcode sequence will then be amplified along withthe genomic fragment.

In a preferred embodiment about 50 pg of each marker molecule may beadded to about 250 ng of genomic DNA, allowing easy identification ofthe barcodes following target amplification and hybridization to arrays.A standard plasmid miniprep with yield of about 10 μg provides enoughplasmid for about 200,000 assays. In a preferred embodiment preparationand storage of the barcode plasmids or sequences is in an area that isfree of genomic DNA samples to avoid contamination of the barcodeplasmids. Barcode plasmids may be stored in a multiwell plate format.For example plasmid solutions at a concentration of 50 pg/μl may becombined in pairs using multi-channel pipets or an automated liquidhandling device to yield a final concentration of 25 pg/μl of eachplasmid. Care should be taken to prevent cross contamination of barcodeplasmids or contamination of barcode plasmids with genomic DNA samplesor amplicons.

In preferred embodiments the results of hybridization to barcode probesets are included in a report generated by a computer after analysis ofa hybridization pattern. Computer implemented methods may be used toanalyze the hybridization pattern, to identify the tag sequences thatare present and to compare this to a database of barcodes to identifythe sample.

In one embodiment the marker molecule plasmids are added directly to thesamples without linearization. In another embodiment the plasmids may belinearized or fragmented prior to being added to the sample. Preferablythe plasmid is not fragmented in the region to be amplified for barcodeanalysis, for example, not at or between the Xba I sites if Xba I willbe used for fragmentation in the subsequent detection assay.

In one embodiment a plurality of probes for the barcode sequences isscreened to identify probes that perform well in a complex background.Many probes may be tested for each tag sequence and a representative setof probes may be selected. The probes may be selected to provide optimalhybridization or they may be selected on the basis of other criteria,such as detectable hybridization over a broad range of conditions andsamples.

In one embodiment a set of marker molecules is provided as a kit. Thekit may include a plurality of marker molecules that vary in the tagsequence they each carry. In one embodiment the marker molecules may beidentical except for the variable barcode sequence. The kit may include,for example, 5-100, 5-50, 10-20, 20-50 or 50-100 different markermolecules. The marker molecules may be provided in separate containersor they may be provided in a multiwell format, for example, a microtitreplate format. Each well may contain a different marker molecule or adifferent known combination of marker molecules and the storage formatmay facilitate used of a liquid transport device that accesses multiplewells simultaneously, for example, a multi-channel pipet.

EXAMPLES

Construction of barcode plasmids to be used as marker molecules: A setof 20 barcode plasmids was constructed. First, a vector, pFC48 (SEQ IDNO. 43), was constructed. The Xho I and Nhe I barcode-cloning sites areat positions 544-549 and 964-969. The ampicillin-resistant, pUC-basedplasmid is carried in the E. coli strain FC240. The plasmid has a polyAsequences downstream, as well as the T3, SP6, and T7 transcriptionalpromoters, all of which may be used for gene expression analysisbarcoding embodiments. To construct the barcode plasmids, phosphorylatedoligo adaptors were cloned into the Xho I-Nhe I sites of the vector. Theresulting 20 plasmids differ from each other only in the 40 bp tagsequence, each of which is composed of tandem GenFlex 20 mer tags (seeTable 1). Flanking each of the barcodes is a common Spe I restrictionenzyme recognition site to allow identification of barcode clones,because the vector lacks a Spe I site. The other distinguishing featureof the plasmids is the presence of dual Xba I and Hind III restrictionenzyme recognition sites, positioned such that treatment of the plasmidswith either Xba I or Hind III cuts the plasmids into two pieces ofapproximately 500 base pairs and 4100 base pairs. The 500 base pairfragments are readily amplified by the 100K mapping assay, whichincludes the steps of restriction enzyme digestion with Xba I or HindIII, adaptor ligation with an adaptor containing a universal primingsite and PCR amplification using a primer to the universal priming site.

SEQ ID NO. 44 shows the sequence of 100K barcode A, as an example. TheXho I and Nhe I barcode-cloning sites are at positions 544-549 and596-601, and the 40-base barcode sequence is at positions 556-595. Theother 100K barcode plasmids have the same sequence, except for the40-base barcode.

The columns of table 1 are as follows: column 1 is the name of thebarcode sequence, column 2 gives the barcode sequence, column 3 is theSEQ ID NO for the barcode sequence, column 4 is the quality controlprimer sequence corresponding to the barcode sequence.

Using Barcode Plasmids in Multiwell Plates

For every 250 ng of genomic DNA (5 μL), add 2 μL of solution from a wellof the barcode plate; thus the genomic sample is now irreversiblybarcoded with 50 pg of each of the two barcode plasmids in that well.Subject the sample to genotyping using the Affymetrix 100K assay asdescribed in the 100K Mapping Assay Manual, available from Affymetrix,Santa Clara. The barcode sequences are amplified and labeled along withthe genomic restriction fragments. The 100K Mapping Arrays have 8 probepairs (perfect match and mismatch) for each of the 20 barcodes. Theseprobes were chosen to flank the central position of the 40 bp.sequenceat regular intervals; some of the barcodes have only antisense probes,and some have sense and antisense, as indicated in the 100K libraryfiles. No screening or probe selection was performed, so there isvariation in probe intensity within a probe set as well as between probesets.

Hybridization results for the barcodes have been measured by twomethods, both developed based upon calculation of the median intensityof the perfect match probes. First, the GDAS (GeneChip DNA AnalysisSoftware, available from Affymetrix, Inc.) report can be configured toshow presence/absence of the barcode based upon intensities above acertain threshold. A safe threshold would be 5000 PM median intensity,which would allow a correct present/absent call in every experiment doneto date. The advantage of this GDAS threshold report method is that itis convenient for the user; however, it gives a present/absent answerand does not readily allow for the detection of tracecross-contamination. A second output method is to report the actual PMmedian intensity. This can be done, for example using a special file inthe GDAS folder. The advantage to this second method is that it allowsthe user more control over the barcode results, including the ability todetect cross-contamination from one sample to another. In preferredaspects the methods are capable of detecting cross contamination at verylow levels of contamination, for example, 0.4 to 2%, 2-5%, or 5-10%contamination. For example, if a contaminated sample is 95% a firstsample and 5% a second sample the first sample is contaminated by thesecond at 5%. Higher levels of contamination, greater than 10% may alsobe detected. The two methods may be combined by having the computersystem report both the actual barcode intensities as well as apresent/absent call, based upon user-tunable thresholds.

Testing Barcodes

For each 100K barcode a unique, barcode-specific PCR primer was designedand tested. This quality control PCR permits a quick, sensitive check ofa sample to determine which barcodes are present. For example, if thebarcodes detected on the array differ from the expected barcodes, theresearcher could use a small aliquot of the original archived,unamplified, barcoded sample as a PCR template for the expected andobserved barcodes. The amount of barcode present in 1 ng of archived,barcoded genomic sample is sufficient template in a standard PCR to givea clear present/absent signal which may be detected as a band on a gel.

The quality control primers are given in table 1. All are designed towork with the common primer 236m13f, the sequence of which is:aacgccagggttttcccagt (SEQ ID NO. 21). Standard PCR conditions, with anannealing temperature of 55° C. and extension times of 30 sec. have beenused. Primers 236m13f and 235m13r (sequence: caggaaacagctatgaccatg) (SEQID NO. 22) may be used in a positive control amplification reaction toamplify a 753 bp product from each of the barcode plasmids. Likewise,the same PCRs can be done to test manufactured barcode preparations forthe expected barcodes. In this case a sampling of templates/primers maybe used.

30 Barcode Plasmids for 500K Mapping Assay

To accommodate the restriction sites used in the 500K mapping assay, aswell as to provide for future possible enzyme fractions, the vectorpFC51 was created. This vector has dual restriction enzyme recognitionsites for the following 14 enzymes: Xba I, Hind III, Sty I, Nsp I, BsaJI, Tsp45 I, Apo I, Sau3A I, HinF I, Tse I, Sau96 I, Mse I, BssK I, andPspG I. A fifteenth enzyme fraction is the enzyme pair Msp I-Ase I. Anapproximately 1750 base pair human genomic fragment separates the XhoI-Nhe I sites to facilitate cloning by allowing better differentiationbetween uncut, single-cut, and double-cut (desired) vector. The vectorused to clone the 500K barcodes, pFC51 (SEQ ID NO. 45), is anampicillin-resistant, pUC-based plasmid. The ˜1750 n's indicate a randomhuman Xho-Nhe genomic fragment used as a stuffer to aid in cloning byallowing better differentiation between uncut, single-cut, anddouble-cut (desired) vector. The rest of the sequence between the BamH Iand EcoR I sites consists of synthetic, GenFlex Tag-derived sequence andwas synthesized. There is a polyA downstream sequence, as well as T3,SP6, and T7 transcriptional promoters. The Xho I and Nhe I cloning sitesare at positions 422-427 2162-2167. The vector pFC51, is carried in thestrain FC243.

The final 500K barcode plasmids were constructed by ligatingphosphorylated oligo adaptors encoding 40-bp tandem Tag sequences intothe Xho/Nhe-digested pFC51 vector. All 30 clones were sequence-verified,and stored as glycerol stocks. Ten of the 500K clones contain the samebarcodes as the corresponding 100K clones, (A, C, D, E, H, I, M, N, R,and S) but the other ten 100K clones were not suitable for 500K due tothe presence of restriction sites (Sau3A I, HinF I, etc.) in thebarcodes. The other twenty 500K barcodes are named 2.01 through 2.20.These plasmids should function for 10K, 100K, or 500K assays, as well asin future assays that utilize the above-named enzymes. As with the 100Kclones, QC primers were designed; see Table 2 for the QC primersequences. The sequence of 500K barcode plasmid 2.01 (SEQ ID NO. 46) isprovided as an example. The Xho I barcode cloning site is at positions422-427 and the Nhe I cloning site is at positions 474480. The 40-basebarcode sequence is at positions 434-473. The remaining 29 500K barcodeplasmids have the same sequence, except for the 40-base barcode.

Because there are thirty 500K clones, it is straightforward to make morethan 96 barcode combinations. One way to proceed would be to use 20barcode plasmids to make 96 pairs as described above and make manyreplica plates of those 96 pairs. To one set of plates add a 21^(st)plasmid to every well for a total of 3 barcode plasmids per well; to asecond set of plates add a 22^(nd) plasmid to every well; continuingthis way would create 960 different combinations of 3 plasmids per well.Each well would be distinct, and each plate could be traced on the basisof barcodes 21-30. Additional plates could be created either by addingmore combinations per well or by making additional barcodes.

The columns of table 2 are as follows: column 1 is the name of thebarcode sequence, column 2 gives the barcode sequence, column 3 is theSEQ ID NO for the barcode sequence, column 4 is the quality controlprimer sequence corresponding to the barcode sequence.

CONCLUSION

It is to be understood that the above description is intended to beillustrative and not restrictive. Many variations of the invention willbe apparent to those of skill in the art upon reviewing the abovedescription. The scope of the invention should be determined withreference to the appended claims, along with the full scope ofequivalents to which such claims are entitled. All cited references,including patent and non-patent literature, are incorporated herewith byreference in their entireties for all purposes. TABLE 1 1 2 3 4 ATGACATTATTGGTTCGGAGCTGACTATCTTGGTGTCGCGC SEQ ID NO: 1CCAAGATAGTCAGCTCCGAACCA (SEQ ID NO: 47) BTACCGCTGTGTGTATGTCGCTGCCGCTCTTTATGCTACGG SEQ ID NO: 2GCGGCAGCGACATACACACA (SEQ ID NO: 48) CCTGCGCTCTAGTTACATCATGCTCTCATTAGTTAGTAGGC SEQ ID NO: 3TGAGAGCATGATGTAACTAGAGCGC (SEQ ID NO: 49) DGCTCGTTATGTATGTAGACGGTTGTATCAATGTGTGCGAC SEQ ID NO: 4CGCACACATTGATACAACCGTCTAC (SEQ ID NO: 50) EGTCGAATATCTCTGTGTGAGGGTATCTTCATCTGTGGAGC SEQ ID NO: 5CCACAGATGAAGATACCCTCACACA (SEQ ID NO: 51) FGAGTCTCTGTACTGTGAGCTTAGTCCAGTTGATTAGTGGC SEQ ID NO: 6CGCCACTAATCAACTGGACTAAGCT (SEQ ID NO: 52) GTCGCGTCATTTATCGAATCGTGAGGTGCTTTGTACTAACG SEQ ID NO: 7CCTCACGATTCGATAAATGACGC (SEQ ID NO: 53) HGAGTTATATCTGTGCTACGGGCGTAATCTCTGTCTAGCTC SEQ ID NO: 8CGCCCGTAGCACAGATATAACTCAC (SEQ ID NO: 54) IGCGTGTACTGTATCTCGTGCGCGTGTAATCTGTCTCTCCG SEQ ID NO: 9CACGCGCACGAGATACAGTACAC (SEQ ID NO: 55) JGATCTGATTACGTCTCTCGCCGCATATCTACGTCTCTAGG SEQ ID NO: 10CGGCGAGAGACGTAATCAGATCAC (SEQ ID NO: 56) KAGTGATCGCACCTTATCTCGGTCTGACTTGAGTTACATGG SEQ ID NO: 11CAGACCGAGATAAGGTGCGATCA (SEQ ID NO: 57) LGTGCTACTTGATCTACATGGGCTTACTATTCATACTGCCG SEQ ID NO: 12CGGCAGTATGAATAGTAAGCCCATG (SEQ ID NO: 58) MGCTCTACGTTCATATCATGGCCTATCGCTCTATATCTGGG SEQ ID NO: 13GCGATAGGCCATGATATGAACGT (SEQ ID NO: 59) NCGTATCGTGCTACCTGCTAGGAGTGTCCTGTACGTTAGCC SEQ ID NO: 14GGACACTCCTAGCAGGTAGCACGA (SEQ ID NO: 60) OATATGTACGAAGCTAGTGGCTATAGTTATTGACGCTGCGG SEQ ID NO: 15CGCAGCGTCAATAACTATAGCCAC (SEQ ID NO: 61) PTTATGCTTTCTACGCCGGGCTGTCTATGTTTACAGCGGGC SEQ ID NO: 16AGCCCGGCGTAGAAAGCAT (SEQ ID NO: 62) QTGCTTATCTTTACACCGGCGTGTAATGATTTCCACGCTGG SEQ ID NO: 17GGAAATCATTACACGCCGGTG (SEQ ID NO: 63) RTCTAATCATTTACGACGCGGTGATACTATTGTCGTCGGGC SEQ ID NO: 18CGACAATAGTATCACCGCGTCGT (SEQ ID NO: 64) STATACTTATTGAGGTCGCGGTCTGACTTTAGATGTCGGGC SEQ ID NO: 19AGTCAGACCGCGACCTCAATAAG (SEQ ID NO: 65) TCATGACTATTAACCTAGCGGGGTCTTGCCAACGTCTGTGA SEQ ID NO: 20GCAAGACCCCGCTAGGTTAATAGTC (SEQ ID NO: 66)

TABLE 2 1 2 3 4 2.01 CGTTATCAACCTCCGTCCGAGCAGTAATTTCAATCGCGTG SEQ ID NO:23 TGCTCGGACGCAGCTTGATA (SEQ ID NO: 67) 2.02CCTTATCCACCTGAGTGAGTGAGGTAGTTTCCACGCTATG SEQ ID NO: 24GTGGAAACTACCTCACTCACTCAGG (SEQ ID NO: 68) 2.03CGTGTTCAAAGCGCGTACCTGCGAATAGTTCCACGTCTGG SEQ ID NO: 25GTGGAACTATTCGCAGGTACGC (SEQ ID NO: 69) 2.04CGCCTTAGACACGTCGTAGACTACTGAGTTACAGTCTGAC SEQ ID NO: 26AGTAGTCTACGACGTGTCTAAGGCG (SEQ ID NO: 70) 2.05CGGTCTTATACCACTGTAGAGCGATGACTGATAATACACG SEQ ID NO: 27TCGCTCTACAGTGGTATAAGACCGA (SEQ ID NO: 71) 2.06CGCTGGATAATCACCTGAGGCGATGCACTGTCATACGATA SEQ ID NO: 28CATCGCCTCAGGTGATTATCCA (SEQ ID NO: 72) 2.07GAGGATGTTACCACTCTGACCGACACGATGGTGCAACTGT SEQ ID NO: 29CGTGTCGGTCAGAGTGGTAACATC (SEQ ID NO: 73) 2.08GATAATGTTACCATACGCGCCGATGTCATCTGGCTACGGT SEQ ID NO: 30CATCGGCGCGTATGGTAACAT (SEQ ID NO: 74) 2.09TTAGTATGTTTCACACGGCGCATATAGCTCTAGTATCCGC SEQ ID NO: 31GCTATATGCGCCGTGTGAAACAT (SEQ ID NO: 75) 2.10GTACTAGATACTCACATCGGCAGAACCTGATATGCTCGCG SEQ ID NO: 32CAGGTTCTGCCGATGTGAGTATCT (SEQ ID NO: 76) 2.11GTTCTTCATTCTACGCACGGATGAACATCTATCGCTCGCT SEQ ID NO: 33GATGTTCATCCGTGCGTAGAATG (SEQ ID NO: 77) 2.12CTACACTATTCTACACCTCGCATGAGACTGTACTAAGCGT SEQ ID NO: 34CGCTTAGTACAGTCTCATGCGAGG (SEQ ID NO: 78) 2.13TTGAATGGTTTCAATCGCGGATATGACTGGAATAGCCGTG SEQ ID NO: 35CAGTCATATCCGCGATTGAAACC (SEQ ID NO: 79) 2.14AGAAGCTATACTATCGCACCAGCAGAACTCTATACACCTG SEQ ID NO: 36GTGTATAGAGTTCTGCTGGTGCGAT (SEQ ID NO: 80) 2.15CTGCAATTATCTACTCTGCGAGTACAATGCCATACGCTCT SEQ ID NO: 37CGTATGGCATTGTACTCGCAGAGT (SEQ ID NO: 81) 2.16CGCACTTCAACAATCGTGTAAGTAGACGTGCATAGCAGTT SEQ ID NO: 38CTGCTATGCACGTCTACTTACACGA (SEQ ID NO: 82) 2.17CGGCTATGTACGACGTGCTACGCTGACCTGTCTAACGTAT SEQ ID NO: 39GTCAGCGTAGCACGTCGTACATAG (SEQ ID NO: 83) 2.18GCGGCTAATTCGACGCTCTAGCCCGCGCTTCATAAGTGTA SEQ ID NO: 40CGGGCTACAGCGTCGAATTAG (SEQ ID NO: 84) 2.19CACACCCGTGCATAAGGTATTCCCGCGATGACCGAGAATT SEQ ID NO: 41CGGGAATACCTTATGCACGGG (SEQ ID NO: 85) 2.20AGGCCGCTGGCACAGTATATTCGCGGCGGTCAGACAATAT SEQ ID NO: 42GCGAATATACTGTGCCAGCGG (SEQ ID NO: 86)

1. A method of marking a plurality of biological samples with adetectable marker, said method comprising: obtaining a plurality ofdifferent nucleic acid marker molecules, where each marker moleculecomprises a different nucleic acid tag sequence; obtaining a pluralityof biological samples; adding an aliquot of each of at least 2 of themarker molecules to each of the biological samples to generate aplurality of barcoded biological samples, wherein each of the barcodedbiological samples in the plurality comprises a different combination ofmarker molecules.
 2. The method of claim 1 wherein each different markermolecule is a marker plasmid, wherein each marker plasmid comprises avector sequence and an insert comprising a tag sequence of at least 40base pairs.
 3. The method of claim 2 wherein the vector sequence isselected from a subsequence of at least 4,000 contiguous bases of SEQ IDNO. 43 or at least 3,500 contiguous bases of SEQ ID NO.
 45. 4. Themethod of claim 1 wherein a detectable marker comprises a combination oftag sequences wherein each tag sequence is detectable by hybridizationto a plurality of probes.
 5. The method of claim 1 wherein a tagsequence comprises between 10 and 100 contiguous bases that are notfound in the human genome.
 6. The method of claim 1 wherein a tagsequence comprises between 10 and 100 contiguous bases that are notfound in the human, mouse, yeast, or rat genomes.
 7. The method of claim1 wherein each tag sequence comprises between 10 and 100 contiguousbases not found in available databases of naturally occurring genomicsequence.
 8. The method of claim 1 wherein each of the nucleic acidmarker molecules comprises a double stranded region comprising twopriming sites flanking a tag sequence, wherein the tag sequence isbetween 20 and 200 bases.
 9. A method of identifying a biological samplemarked with a detectable barcode marker, according to the method ofclaim 1, comprising: fragmenting the biological sample with arestriction enzyme to generate restriction fragments; ligating anadaptor to the restriction fragments to generate adaptor-ligatedfragments; amplifying the adaptor-ligated fragments using a primer thatis complementary to the adaptor; labeling the amplified fragments with adetectable label; hybridizing the labeled fragments to an array ofprobes wherein the array comprises probes that are complementary todifferent tag sequences in the plurality of marker molecules; analyzingthe hybridization pattern to identify which tag sequences are present inthe sample; determining the barcode present in the sample, wherein thebarcode is the combination of tag sequences that are present in thesample; and determining the identity of the sample from the barcode. 10.The method of claim 9 wherein the array further comprises a plurality ofallele specific genotyping probes and the hybridization pattern isfurther analyzed to determine the genotype of a plurality of singlenucleotide polymorphisms.
 11. The method of claim 10 wherein theplurality of single nucleotide polymorphisms comprises more than 10,000human single nucleotide polymorphisms.
 12. The method of claim 10wherein the plurality of single nucleotide polymorphisms comprises morethan 500,000 human single nucleotide polymorphisms.
 13. The method ofclaim 9 wherein the barcode comprises 2 different tag sequences.
 14. Themethod of claim 9 wherein the barcode comprises 3 different tagsequences.
 15. A method of marking each sample in a plurality ofbiological samples so that each sample is marked with a detectablebarcode marker that is different from the barcode marker of each of theother samples in the plurality, said method comprising: putting analiquot of each sample in a different well of a multi-well platecomprising 12 columns and 8 rows; obtaining 12 different first markermolecules and 8 different second marker molecules; putting an aliquot ofone of the first marker molecules into each well of each column so thateach column has a different first marker molecules and all wells in acolumn have the same first marker molecule; and, putting an aliquot ofone of the second marker molecules into each well of each row so thateach row has a different second marker molecules and all wells in a rowhave the same second marker molecule.
 16. The method of claim 15 whereineach different first marker molecule and each different second markermolecule contains a different tag sequence.
 17. The method of claim 16wherein each different tag sequence is flanked by a first and secondrestriction site for a first restriction enzyme.
 18. A method formarking a plurality of X genomic samples using Y different independentmarker molecules, where Y is less than X, comprising: obtaining aplurality of X genomic DNA samples; adding an aliquot of each of two ofsaid Y different independent marker molecules to each of the X genomicDNA samples so that no two of the X genomic DNA samples has the samecombination of independent marker molecules added.
 19. The method ofclaim 18 wherein each independent marker molecule comprises a plasmidvector backbone portion and an insert portion wherein the insert portioncomprises at least one 20 base pair tag sequence.
 20. The method ofclaim 19 wherein each independent marker molecule comprises a gene thatconfers a selectable phenotype.
 21. The method of claim 18 wherein eachindependent marker molecule has at least two restriction sites for arestriction enzyme, wherein digestion of the marker sequence with therestriction enzyme generates a restriction fragment that comprises thetag sequence and is between 200 and 1000 base pairs.
 22. The method ofclaim 18 wherein X is greater than 90 and Y is between 10 and
 30. 23. Amethod of detecting contamination of a first sample with a secondsample, wherein the first sample is marked with a first barcode and thesecond sample is marked with a second barcode, wherein a barcodecomprises a known combination of 2 to 5 tag sequences and said firstbarcode and said second barcode are different: fragmenting the firstsample with a restriction enzyme to generate restriction fragments;ligating an adaptor to the restriction fragments to generateadaptor-ligated fragments; amplifying the adaptor-ligated fragments bypolymerase chain reaction using a primer complementary to the adaptor togenerate amplified fragments; labeling the amplified fragments;generating a hybridization pattern for said first sample by hybridizingthe labeled fragments to an array of probes comprising probescomplementary to said tag sequences; analyzing the hybridization patternto determine which tag sequences are present in said first sample; and,determining that said first sample is contaminated with said secondsample if the barcode of the second sample is detected in the firstsample.
 24. The method of claim 23 wherein the restriction enzyme isselected from the group consisting of Xba I, Sty I, Nsp I, Hind III andEco RI.
 25. The method of claim 23 wherein the array of probes comprisesat least 10,000 different probes each present in a different feature ofthe array.
 26. The method of claim 25 wherein the array is a genotypingarray, comprising allele specific probes complementary to known humansingle nucleotide polymorphisms.
 27. A method of determining thegenotype of a sample at a plurality of single nucleotide polymorphismsand the identity of the sample, comprising: marking the sample to beanalyzed with a barcode to generate a marked sample, wherein the barcodecomprises a known combination of marker molecules each carrying adetectable tag sequence; fragmenting an aliquot of the marked samplewith a restriction enzyme to generate restriction fragments; ligatingadaptors to the restriction fragments to generate adaptor-ligatedfragments; amplifying the adaptor-ligated fragments; labeling theamplified fragments and hybridizing the labeled fragments to an array,wherein the array comprises genotyping probes and probes complementaryto tag sequences to generate a hybridization pattern; and, analyzing thehybridization pattern to determine the genotype of the sample at saidplurality of single nucleotide polymorphisms and to determine thebarcode.
 28. The method of claim 27 wherein the known combination ofmarker molecules comprises 2 different marker molecules.
 29. The methodof claim 27 wherein the known combination of marker molecules comprises3 different marker molecules.
 30. The method of claim 27 wherein theknown combination of marker molecules comprises 4 different markermolecules.
 31. A kit comprising a plurality of at least 10 differentnucleic acid marker molecules wherein each marker molecule is physicallyseparated from every other marker molecule and each comprises adifferent tag sequence.
 32. The kit of claim 31 comprising 10 to 20different marker molecules.
 33. The kit of claim 31 comprising 20 to 50different marker molecules.
 34. The kit of claim 31 wherein eachdifferent marker molecule comprises a tag sequence selected from SEQ IDNOS. 1-20 and 23-42 cloned into one or more restriction sites of SEQ IDNO. 43 or SEQ ID NO.
 45. 35. The kit of claim 31 wherein each differentmarker molecule comprises a fragment comprising a different tag sequencewherein the fragment is cloned into the XhoI and NheI sites of SEQ IDNO. 43 or the XhoI and NheI sites of SEQ ID NO.
 45. 36. A kit comprisinga plurality of at least 20 different nucleic acid marker moleculeswherein the marker molecules are provided in a multiwell container, sothat different combinations of at least two marker molecules are presentin each of a plurality of wells.
 37. The kit of claim 36 wherein themultiwell container is a multiwell plate.
 38. The kit of claim 36wherein the multiwell plate comprises 96 or 384 wells and each wellcontains a different combination of marker molecules.
 39. A kitcomprising a plurality of barcode plasmids wherein each barcode plasmidcomprises a different tag sequence, wherein cleavage of each barcodeplasmid in the plurality with a selected restriction fragment releasesthe tag sequence on a restriction fragment that is between 200 and 2000base pairs in length.
 40. The kit of claim 39 wherein the restrictionfragment is between 300 and 1000 base pairs.
 41. The kit of claim 39wherein the restriction fragment is between 400 and 800 base pairs. 42.The kit of claim 39 wherein the plurality of barcode plasmids comprisesat least 10 different plasmids and wherein the restriction enzyme isselected from the group consisting of Styl, Nsp I, XbaI, HindIII andEcoRI.